\documentclass[fleqn]{article}

% start meta data
\title{The Effect of Scaling and Mean Centering Prior to a Principal Component Analysis}
\author{Sebastian Raschka\\ \texttt{se.raschka@gmail.com}} 
\date{\today}
% end meta data

\usepackage{amsmath}

% start header and footer
\usepackage{fancyhdr}
\pagestyle{fancy}
\lhead{Sebastian Raschka}
\rhead{Centering and scaling variables for PCA}
\cfoot{\thepage} % centered footer
\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\footrulewidth}{0.4pt}
% end header and footer




\begin{document} % start main document

\maketitle % makes the title page from meta data



\noindent A frequently asked question is whether data needs to be centered prior to dimensionality reduction via \emph{Principal Component Analysis (PCA)} or not. Here, the assumption is that the PCA is computed based on the \emph{covariance matrix}, i.e., the $k$ principal components are the eigenvectors of the covariance matrix that correspond to the $k$ largest eigenvalues.



\section{Mean Centering Does not Affect the Covariance Matrix}
\label{meancenteringdoesnotaffectthecovariancematrix}

Under the assumption that the PCA is obtained from covariance matrix, the resulting principal components will be the same regardless of whether mean centering was performed or not as long as the covariance matrix stays the same. The following equations will show that \emph{mean centering} does not affect the covariance matrix.
\noindent Let  $\bf{x}$ and $\bf{y}$  be two random variables so that  the covariance between the attributes is calculated as

\begin{equation} \sigma_{xy} = \frac{1}{n-1} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})   \end{equation}

\noindent The centered variables can be written as

\begin{equation} x' = x - \bar{x} \text{ and } y' = y - \bar{y} \end{equation}

\noindent And the centered covariance can be calculated as follows:

\begin{equation} \sigma_{xy}' = \frac{1}{n-1} \sum_{i}^{n} (x_i' - \bar{x}')(y_i' - \bar{y}')   \end{equation}

\noindent Mean centering results in samples means of 0, $\bar{x}' = 0$ and $\bar{y}' = 0$, thus

\begin{equation} \sigma_{xy}' = \frac{1}{n-1} \sum_{i}^{n} x_i' y_i'   \end{equation} 


\noindent This is equal original covariance matrix, which can be shown by resubstitution:

\begin{equation} x' = x - \bar{x} \text{ and } y' = y - \bar{y} \end{equation}

\noindent Even the centering of only one variable,  e.g., $\bf{x}$,  would leave the covariance matrix unaffected. 

\begin{equation} \sigma_{\text{xy}} = \frac{1}{n-1} \sum_{i}^{n} (x_i' - \bar{x}')(y_i - \bar{y})   \end{equation}
\begin{equation}  =  \frac{1}{n-1} \sum_{i}^{n} (x_i' - 0)(y_i - \bar{y})   \end{equation}
\begin{equation}  =  \frac{1}{n-1} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})   \end{equation}



\section{Scaling Does Affect the Covariance Matrix}
\label{scalingofvariablesdoesaffectthecovariancematrix}

However, in contrast to mean centering, \emph{scaling} (e.g, pounds into kilogram where 1 pound = 0.453592 kg) \textbf{does} have an effect on the covariance matrix and therefore influences the results of a PCA.

\noindent Let $c$ be the scaling factor for $\bf{x}$. Given that the ``original'' covariance is calculated as

\begin{equation} \sigma_{xy} = \frac{1}{n-1} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})   \end{equation}

\noindent the covariance after scaling is calculated as follows:

\begin{equation} \sigma_{xy}' = \frac{1}{n-1} \sum_{i}^{n} (c \cdot x_i - c \cdot  \bar{x})(y_i - \bar{y})   \end{equation}
\begin{equation} =  \frac{c}{n-1} \sum_{i}^{n} (x_i -   \bar{x})(y_i - \bar{y})   \end{equation}

\begin{equation} \Rightarrow \sigma_{xy} = \frac{\sigma_{xy}'}{c} \end{equation}
\begin{equation} \Rightarrow \sigma_{xy}' = c \cdot \sigma_{xy} \end{equation}

\noindent Therefore, the scaling of one attribute by a constant $c$ will result in a rescaled covariance $c \sigma_{xy}$. E.g., if an attribute $\bf{x}$ measured in pounds is rescaled  into kilograms, the covariance between $\bf{x}$ and $\bf{y}$ will be 0.453592 times less the original value.



\section{Standardizing Does Affect the Covariance Matrix}
\label{standardizingaffectsthecovariance}

Standardization of features also has an effect on the outcome of a PCA since standardization can be understood as a scaling the covariance between every pair of variables by the product of the standard deviations of each pair of variables.

\noindent The equation for standardization of a variable is written as 

\begin{equation} z = \frac{x_i - \bar{x}}{\sigma} \end{equation}

The ``original'' covariance matrix:

\begin{equation} \sigma_{xy} = \frac{1}{n-1} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})   \end{equation}

And after standardizing both variables:

\begin{equation} x' = \frac{x - \bar{x}}{\sigma_x} \text{ and } y' =\frac{y - \bar{y}}{\sigma_y} \end{equation}

\begin{equation} \sigma_{xy}' =  \frac{1}{n-1} \sum_{i}^{n} (x_i' - 0)(y_i' - 0)   \end{equation}

\begin{equation}  =  \frac{1}{n-1} \sum_{i}^{n} \bigg(\frac{x - \bar{x}}{\sigma_x}\bigg)\bigg(\frac{y - \bar{y}}{\sigma_y}\bigg)   \end{equation}

\begin{equation}   = \frac{1}{(n-1) \cdot \sigma_x \sigma_y} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})   \end{equation}

\begin{equation} \Rightarrow \sigma_{xy}' = \frac{\sigma_{xy}}{\sigma_x \sigma_y} \end{equation}


\end{document} % end main document